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ABSTRACT 

2 
A  bounded  reeion  in  R  with  a  uniform  density  function  defined  over  it 

is  partitioned  into  k  sub-regions  such  that  the  within  cluster  sum  of  squares 

is  minimized.   An  asymptotic  (k-^<»)  lower  bound  for  the  within  cluster  sum  of 

squares  of  this  optimal  k-means  partition  is  obtained.   This  lower  bound  is 

useful  in  suggesting  that  the  graph-configuration  of  the  optimal  k-partition 

would  consist  of  regular  hexagons  of  equal  size  when  k  is  large  enough.   An 

empirical  study  illustrating  these  asymptotic  properties  of  bivariate  k-means 

clusters  is  also  presented. 
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1.   INTRODUCTION 

Let  the  observations  x  , ...,x^  be  sampled  from  a  distribution  F  with 

density  function  f.   In  cluster  analysis,  the  k-means  clustering  method  (see 

Hartigan  (1975),  Chapter  4)  is  often  used  to  partition  the  sample  of  N 

observations  into  k.  clusters  with  means   x_,...,x,  .   The  resultant  clus- 

1      k 

ters  satisfy  the  property  that  no  movement  of  an  observation  from  one  cluster 
to  another  reduces  the  sample  within  cluster  sum  of  squares 

WSS^  =  .1.    ."^^;.  II  X.  -  x.||'/(N-k) 
N   x=l  l<j<k     1    J 

For   these  k-means  clusters,  a  k-partition  of  the  sampled  space  can  be  defined 
by  associating  each  cluster  mean  x.   with  the  convex  polyhedron  C.   of  all 
points  in  R  closer  to  x.   than  to  any  other  cluster  mean.   The  correspond- 
ing optimal  k-means  partition  in  the  population  F  is  defined  by  the  popula- 
tion cluster  means  y  , .  .  .  ,y  ,  which  are  selected  in  such  a  way  that  the 

X         IS. 

within  cluster  sum  of  squares 

WSS  =  /   l^jlkl |x  -  M.| I   dF 

For  fixed  number  of  clusters  k,  the  asymptotic  convergence  (as  N  -»■  °°)   of 
the  sample  k-means  clusters  to  the  population  k-means  clusters  has  been  studied 
by  MacQueen  (1967),   Hartigan  (1978),  and  Pollard  (1981),   The  most  recent  re- 
sult can  be  found  in  Pollard  (1981) ,  in  which  conditions  are  found  that  ensure 
the  almost  sure  convergence  of  the  set  of  means  of  the  k-means  clusters.   How- 
ever, the  asymptotic  properties  of  k-means  clustering  in  the  case  where  the 
number  of  cluster  k  increases  with  the  sample  size  N  did  not  receive  much 
attention. 

In  Hartigan  and  Wong  (1979a)  and  Wong  (1980),  some  asymptotic  properties 
(as  k  ->•  <=°)  of  the  population  k-means  clusters  in  one  dimension  are  obtained. 
It  is  shown  that  the  within  sum  of  squares  of  the  k  clusters  are  asymptotically 
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equal,  and  that  the  length  of  the  jth  cluster  interval  (l<j<k)  is  inversely 

1/3 
proportional  to   f(c.)    ,  where   c.   is  the  midpoint  of  the  jth  cluster 

interval.   It  is  also  shown  that  if  k(N)  ->  «>  as  N  ->  «>  with  k(N)  = 

1/3 
o(N/log  N)    ,  then  the  sample  k-means  clusters  have  asymptotic  properties 

similar  to  that  of  the  population  clusters.   Using  these  results,  it  can 
also  be  shown  that  a  uniformly  consistent  histogram  estimate  of   f,   which 
is  constant  over  each  k-means  cluster  interval,  can  be  constructed  from  the 
£•  imple  using  the  k-means  method. 

Unfortunately,  these  univariate  results  cannot  be  easily  generalized  to 
the  multivariate  case.   Only  empirical  evidence  exists  to  support  the  con- 
jecture that  similar  asymptotic  results  hold  in  several  dimensions  for 
k-means  clusters,  and  that  a  uniformly  consistent  histogram  estimate  of  a 
multivariate  density   f  can  be  constructed  by  the  k-means  method.   The  latter 
uniform  consistency  result  is  of  special  practical  interest  as   it  would  just- 
ify the  usage  of  the  computationally  efficient  k-means  method  (Hartigan  and 
Wong,  1979b)  for  estimating  multivariate  density  functions  from  large  samples. 

In  this  paper,  some  asymptotic  properties  of  the  population  k-means  clus- 

2 
ters  for  uniform  distributions  in  R   are  given,     In  Section  2,  an  asymp- 
totic lower  bound  for  the  WSS  of  the  optimal  k-means  partition  is  obtained. 
Since  this  lower  bound  is  attained  when  all  k  clusters  of  the  partition  are 
regular  hexagons  of  equal  area,  this  result  suggests  that  the  graph  configuration 
of  the  optimal  k-means  partition  would  consist  of  regular  hexagons  when  k  is 
large  enough.   An  empirical  study  is  performed  to  illustrate  these  asymptotic 
properties  of  bivariate  k-means  clusters,  and  the  results  are  given  in  Section 
3. 
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The  results  given  in  this  paper  fall  short  in  generalizing  the  asymp- 
totic properties  of  the  univariate  population  k-means  clusters  to  the  bi- 
variate  case.   However,  they  are  the  first  results  obtained  in  the  investi- 
gation of  the  properties  of  bivariate  k-means  clusters. 
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2.   AN  ASYMPTOTIC  LOWER  BOUND  FOR  THE  WITHIN  CLUSTER 
SUM  OF  SQUARES  OF  K-MEANS  CLUSTERS 

In  this  paper,  some  asymptotic  properties   (k  ■*  °°)   of  the  population 

k-means  clusters  for  the  uniform  density  in  two  dimensions  are  given.   The  main 

result  is  Theorem  2,  which  gives  an  asymptotic  lower  bound  for  the  within  cluster 

sum  of  squares  of  the  optimal  k-means  partition. 

Theorem  2: 

Let  jlj,     be  a  region  of  area  A  with  a  connected  interior  in  R  .   Suppose  that 

the  loundary  of  A  has  finite  length  and  let  1/A  be  the  constant  density  over 

d. 

Let  WSS  be  the  minimum  within  cluster  sum  of  squares  over  all  k-partitions. 
Then 

(Remark:   Since  the  asymptotic  lower  bound  given  in  Theorem  2  is  attained  when  all 
k  clusters  of  the  partition  are  regular  hexagons  of  equal  area,  this  result  sug- 
gests that  the  graph  configuration  of  the  optimal  k-means  partition  would  consist  of 
regular  hexagons  when  k  is  large  enough.) 

In  outline,  the  proof  of  Theorem  2  requires  first  showing  that  the  polygon  with 
n  edges  and  area  A  which  has  the  minimum  within  polygon  sum  of  squares  is  regular 
(see  Lemma  1).   For  any  polygon  divided  into  k  clusters,  a  lower  bound  for  the 
limiting  value  (lim  inf)  of  the  within  cluster  sum  of  squares  may  then  be  found  by 
assuming  the  clusters  are  regular  hexagons  (Theorem  1) .   An  upper  bound  may  also  be 
found  by  covering  the  polygon  with  regular  hexagons.   Since  the  ratio  of  the  two 
bounds  approaches   1  as  k  ^  =°,  Theorem  2  follows. 

Hence,  to  show  the  result  in  Theorem  2,  we  need  the  following  lemmas: 
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LEMMA   1 

If      f      is   the  constant  density  over  an  n-sided  polygon  'Ji    of  area     A  , 
then  a  lower  bound  for     WPSS^  ,      the  within  polygon    sua  of  squares  ot  ^  , 
is  given  by     ^  fA^^^^^y^  +  \  tan  ^1    . 

However,    to   prove  Lemma   1,   we  need  Lemma  1.1  and  Lemma  1.2. 
Lemma  1.1 

For  a  triangle  A  ,  with  fixed  area  A  ,  and  fixed  angle  9  at  the 
vertex  V  ,  the  minimum  value  of  /  r^d(Area)  (where  r  is  the  dis- 
tance from     V)      is     -^  ^[- ,  /,  ^     +  v  tan  ■=-  9   3    »      achieved  when     A 

Z  o  tan  l/zo    3     2o 
o 

is  isosceles  with  equal  edges  adjacent  to  V  . 

Proof. 

Consider  the  triangle  VTS 

with  angle  SVT  =  6   and 

has  an  area  of  A  . 

o 

Let  M  be  the  nidpoint 

between  the  vertices  T  and 

S   (see  Fig.  1).       Without 

loss   of   generality,    let     M     be  at 

the  origin.      Let     e     be   the  unit  vector  along  the  base     ST  (x-axis)    and 

put      j{TM[l=    t  .      Also,    let     V     be  represented  by    ^     .      It  follows   that 

if     ST     is  rotated  by     d9     about     M     to     S'T'      (see  Fig.  1) ,  the 


Fig.  1 
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increment  in   /  r^d  (Area),  the  second  rncrnent  about   V  ,   is  given  by 

t  ^  t 

^o    ll£  +  ^£ll  ^^^^^  "  ^      II  z  -  5^111^  5vdxde 

=  S^     4(z-e)  x2dxde 
o 

4  ,    .3 
=  J  U-£)t 

Thus,  this  increment  is  positive  or  negative  as  z-e  is  positive  or 

negative.   And  hence  the  miniTnum  occurs  when  z-e  =  0  ;   that  is,  v;hen 

the  triangle  it,  isosceles. 

Now,  if  w^  move  towards  the  isosceles  position  by  such  a  rotation,  the 

area  increases.   Thus,  for  a  given  triangle  VTS  ,  we  can  decrease  the 

2nd  moment  about  V  by  first  rotating  to  the  isosceles  position,  and  then 

sliding  the  base  back  towards  V  until  the  triangle  has  area  A  .   Since 

the  2nd  moment  about  V  for  an  isosceles  triangle  of  area  A   is 

equal  to  -^A   [ .  »„  „   +  —  tab  -^  ]  ,   the  lemma  folloys. 

2  o   tan  1/2  e    3     2  o  ' 
o 

Lemma  1,2 


Let  VT  and  VS  be  two  lines  in  ]R2  j^t'Tx   angle  T\'S  =  a  <  11  . 

o 

Suppose    that     Q     is  a  union  of   quadrilaterals   (whose  inreriors  are  dis- 
joint)   all   of  whose  vertices  lie  on     VT     and     VS    .      Let   the  area  of     Q 

be     A      .      Then     /  r^d    (Area),    the   second  moment   about   V,    is  minimized  when 
O  'Q  *  ' 

Q     is   an   isoceles   triangle. 
Proof. 

[I]      Fix  an  integer     i     and  real  number     u   ,      and   let     ,^(i,u)      be  the 
set   of   plane   figures   such   that   every     Q   C    ^(i,u)      is  a  union  of   quad- 
rilaterals   (v;hose   interiors   are  disjoint),    all   of    whose  vertices  lie  on 
VL.      or      VL      ;      L.      and      l.        lie   respectively  on     VT     and     VS   ,      with 
IjVL   II    =    IJVL    l|    =  u    .      Using   the  labeling  system  shov/n   in  Fig.  2,  it 


-7- 


is  clear  that  every  Q  €  ^(i,u) 
is  uniquely  determined  by  the  set 


of  vertices   (x, ,  ...,  x, .)  ,   where 


^2i=-^8 


"r    •"'     4i' 


x^.x^ 


the  x . 's  satisfy 
(i)      0  5  x^  <   ••■ 
and 


<  X        <  u 
_  x^^  -  u   , 


u    . 


T        I 

Aa  Example  of  an 
Q   ^   ^(4,u)    .      (x.'s  are   the 
distances  from     V  .) 


{,?X}       V   _  X^^^j  _      -  x^^ 

Thus,   ^(i,u)   can  be  identified 

4i 
with  a  compact  subset  of   [0,u] 

from  which  it  inherits  a  natural 

topology  (pointwise  convergence  of 

the  X  's).   Under  this  topology,  the  two  mappings  f   :^(i,u)  ->TR  and 
3  ^ 

f   :  ^(i.u)  ■*'i^  >      where  f  (Q)  =  area  of  Q  and  f  (0)  =  /^-^d  (Area), 
are  continuous  on  ^(i,u)  .   Therefore,  if  A^  5  area  of  triangle 
the  set 
^(i,u)  =  {Q  €  ^(i,u)   such  that  f  (Q)  =  A  }  is  nonempty  and  compact. 


^1^2 


to 


It  follows  that     min     f  (Q)   is  attained  by  sods  0  €  ^  (i,u)  . 

Qe^^(i.u)     " 

[II]      Next,    we  will   show  that,    for  any     i     and     u    ,     Q^     is  an  isosceles 

triangle.      It   is   enough   to  show  that  the  result  holds  when     i  =  2   . 

Suppose   that     Q       were  not  an  isosceles   triangle.      Using  Lensina  1.1 

eliminate  other  cases,    it   is   sufficient    to  consider  the  case  when 

0     =   triangle     VBA  u   triangle  AFH      (see  Fig.  3) ,  where 

^o 

(1)  ||VB[!   >    IIVAil    ,      and 

(2)  !|VFI|   >    IIVHll    . 
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Consider  the  triangle  AFH  . 
Choose  a  point  C  on  BF  , 
close  to  F  .   Let  fhe  line 
through  C  parallel  to  VS 
cut  the  triangle  AFH  along 
the  segment  C*G*  .   Trans- 
late C*G*  along  this 
line  t.'  form  a 
trapezium  CGHA  with 
||CG||  =  ||C*G*||  .   By  this  construction  (see  Fig.  3),      the  trapeziums 

CGHA  and  C*G*HA  have  the  same  area,  but   /  r^  d  (Area)  < 

CGHA 
/   r^d  (Area). 
C*G*HA 

Produce  HG  to  intersect  VT  at  D  .   Since  I|CG[[  =  1|C*G*1|  ,   triangle 

C*FG*  has  a  larger  area  than  triangle  CDC  .   Part  of  the  area  of  C*FG* 

can  therefore  be  redistributed  to  complete  the  quadrilateral  CDHA  with 

a  decrease  in  2nd  moment.   The;  remaining  area  can  be  added  to  triangle 

VBA  to  produce  triangle  VB*A  .   If   ||CFj|   is  small  enough,  all  the 

points  of  triangle  C*FG*  are  further  fro:a  V  than  all  the  points  of 

triangle  ABB*  ;   2nd  moment  is  thus  decreased. 

[Ill]   From  the  result  of  [I]  and  [II],    nin     f  (Q)   is  attained  by 

Q-:.^^(i,u)  ^ 

Q  ,   the  isosceles  triangle  with  equal  edges  adjacent  to  V  .   Notice 

that  it  is  the  same  Q   for  each  ^(i.u)  .   Thus  Q   gives  the  minimum 

o  o 

value  of  /  r^d  (Area)   over   U  ^(i,u)  . 

Now,  we  can  proceed  to  obtain  the  result  given  in  Lemma  1. 
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Proof  of  Lemma  1. 


[I]  Consider  a  given  n-sided  polygon  tx/ 
of  area  A  .   By  joining  the  centroid   C 
and  occh  of  the  n  vertices  of ,  je/ ,   we 
obLain  an  n-partition  of  tJv  defined  by 
the  n  cones  radiating  from  C   (see 
Fig.  A).      Let  j^.   be  the  subset  of 
jy  in  the  ith  cone.   Let  A,  be  the  area 
of  t^   and  let   29.  (<n).  be  the  angle 
subtended  at  C  by  the  ith  cone.   Then 


n 


n 


Z  A.    =  A     and      E   6,    ^  II   .      From  Lemma    1.1       and  Lemma     1.2       we  have  fojT 
1      ^  1      ^ 

each     1    <  i  <  n  , 

where  r   is  the  distance  from  C  .   Summing  over   i  ,   ve  have 

n 
I 
1         1 


(1)     UTS^    -  f   ^   {^.   ^'^    (Area)   >  |f   I  A.2(^ 


-e7  -^  J  taa  9,] 


1  1 

rn]      Next,   we  will  find   the  minimum  of     E  A.^f t-  +  ^  -an  9.]     undei: 

,      X    'tan  9.        3  1 

1  1 


n 


n 


IT 


the  constraints:       (i)      Z  A.    -  A     and      (ii)      Z   9,    =  n  ;      6-    <  y     ^ox 

1      ^  1      ^  ^ 

1  -  i  -  n   .      Now,    using  Lagrange  multipliers,    the  mininium  Eust   satisfy: 


(3)      A/(- 


sec^9, 
2r  1.1 


^  tan^g.        J  1  ^ 


where  c    and   c    are  constants  independent  of   i  . 
Thus,  squaring  (2)  and  then  dividing  by  (3),  we  have 
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(1   +  -r-  t-in2e   )2    tan-2e./[-(l   +   tan-29.)   +-^(1   +   can^e.)] 

and  hence 

(A)  (|  tan-^eJ/Cl   +  I  tan2e.    +  i  Can'*6.)    =   1  +  J-  =  r      . 

Since   the  left   side  of    (4)    is  a  strictly  increasing  function  of      tan^g.    ^ 
tan^e.      is  a  constant. 

But      b.    <  n/2     for  all      i   ,      so  we  .oust  have      6.    =  Il/n     for  all      1    <  i  <  n  . 
Also,      A.    =  A/n     for  all      1    <  i  <  n   . 

[Ill]      Using  the  result  of    [II],    we  have  fron    (1)    that 

1        "  11 

WPSS.  >  ^  f  Z  A.2[      \     +  i-  tane.] 
^        2        ^      X     tane^       3  1' 

>    1     f  A2[__L-7_  +  i  tan  H]    , 
2n  "^tan  n/n       3  n     * 

and    the   equality  holds  when   the  given  n-polygon    tJS/   is  regular,    which 

gives    the   leimna. 

(Remark:      In   the  application  of -this  lemma,      f      is  usually  the  constant 

density  over  a  region  containing   the  n-sided   polygon     tjf .     Thus, 

f    <  1/A     for  most  applications.) 

Next,    in  Theorem  1,   we  will  obtain  a   lower  bound   for   the  within  cluster 

2 
sum  of   squares   of   the  optimal  k-means  partition  of   a       polygon  in     R    .      However, 

it   is    important    to   first   establish   that   it   is   sufficient   to   consider  only   "3-edge" 
k-partitions    (k-partition  whose   corresponding  graph  configurations  have   the  property 
that   all   the   interior  vertices   are  associated  with  exactly   3  edges.)      Hence,   we  need 
Lemma   2 . 
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Lemma  2 

Let  t^  be  a  region  with  connected  interior  in  R  .   Suppose  that  the  constant 

density  over  ^j^    is   f  and  x^xat  ^^    is  partitioned  into  k  regions.   Then  for 

every  k-partition  and  for  every  e  >  0,  there  exists  a  "3-edge"  partition  whose 

within  cluster  sum  of  squares  differs  from  that  of  the  given  partition  by 

at  most  e . 

Proof. 

Since  the  within  cluster  sum  of  squares,  WSS  ,   can  be  expressed  in  the 

form: 

k 


WSS  = 


Z  WSS^  '  ^  /j^  '^i'^  ^^^^^^    » 


where     r     =  distance   to  ith  cluster  centroid  and  »JSy.      is   the   ith  cluster, 
i  1 

it  is  clear    that     WSS      is  a  continuous   function  of   the  vertices  of    the 
k-partition,    for  every  fixed     k   . 
The  lemma  follows. 

Theorem  1: 

Let  ^   be  a  polygon  with  a  connected  interior  in  F^-  . 

Suppose  that  j;/  has  area  A  and  that  f   is  the  constant  density  over 

J^.   Let  WSS  be  the  minimum  within  cluster  sum  of  squares  over  all 

possible  k-partition  of  J^  .   Then 

^wss/(i^.  5g)  >  1  . 
k>«'       k     54 
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Proof . 

Let     A.      be   the   area  of    the   ith  cluster   and     E,      be   the  number  of    ed^es 
1  i  ° 

of    the   ith   cluster. 

k 
[I]      We  will  first   obtain  an  expression  for     T.   E.    . 

1      ^ 

Consider   the  configuration  of    the  optinal  k-partiticr.  of  jfy  . 

Using  Leiima      2  and   by  choosing  an  arbitrarily   sniall      £    ,      it  is 

enough   to  examine  a  "3-edge"   k-partition. 

Let     n     be   the  number  of  vertices  of    the  polygon   tJ^ • 

Then     using   a  continuity  argument,    it   can  be  sho'-^m    that   it   is   enough   to 

consider   partitions  with     n  =  n     ,      where     n       is   the  number  of  vertices 

in  the   configuration  of    the  given  partition  associated  with  exactly  two 

edges.      Let     n       be   the  number  of  vertices  associated   with  3-edges  in 

the  configuration.      Using  some  results   in   graph  tbeery,   we  have 

(1)  2E  =  3n     +  2n  where     E     is  the   total   number  of   edges. 

Moreover,    Euler's   formula  gives 

E  +  1    =  F  +  V  ,      where     F     is   the  number  of   faces   (clusters) 

and     V     is   the  number   of  vertices. 
Therefore,    from    (1),    we  have 

i(3n^  +  2n)   +  1   =  k  +   (n^  +  n)    , 

which  gives 

(2)  n^   =  2(k  -  1)    . 

Hence  from  (1)  and  (2), 

n 
(^3)        J  E,  =  2E  -  number  of  edges  around  the  perimeter 

1      ^ 

=  2E  -   (n^  +  n) 


=  6(k-l)   +  n  -  n^  , 


where     n^     is   the  number  of   "3-edge"  vertices  on  the  boundary  of 
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^ 


(Relationship  (3)  holds  for  all  partitions  in  which  the  vertices  of  the 
polygon  tjil   have  tv."o  edges  meeting  then,  and  all  remaining  vertices 
have  three.) 

[II)   Next,  we  will  find  a  lower  bound  for  WSS  . 

Let  USS.   be  the  \dlthin  polygon  sum  of  squares  of  the  ith  cluster. 

k 

Then  WSS  =  Z   WSS.  . 
1    ^ 

By  Lernma      1,  we  have. 

Therefore, 


k  k 

1X1  1  1 


where     g(E.)    =  ^  liiOTE"  "^  I  ^^"^  F^^        ^°^  ^=1'    ^>    ••"    ^^    ' 

X  i  X 


Now  the  minimura  of      Z  A.2g(E,)      with  all    the  E.'s  being  real  numbers  is 

J       1  1  X 

not  greater  than  its  minimum  with  all  the  E.'s  being  integers,  when 
both  are  subjected  to  the  constraints: 

k  k 

(i)    E  A.  =  A        and        (ii)  I  E.  =  6(k-l)  +  n  -  n„  . 
1   1  1   1  B 


Consider    the  minimum  value  of      Z   A^g(E.)      with  all    the  E.'s  being  real 
numbers.      Using  Lagrange  multipliers,    this  minimum  must   satisfy 

(iii)      A^g(E^)    =  ci   ,        and  (iv)     \h^^^ i\)    =  c^  , 

where   c   and   c   are  constants  independent  of   i  .   It  follows  from 
(iii)  and  (iv)  that 

=  — -  =  constant    for   i=l,  ..-,  k  . 

g(Ei)      ^1  _14- 


But  it  can  be  shown  that  ,   the  first  derivative  of   -1/g  at 


E 


is  a  monotone  decreasing  function  of   E.  . 
i  1- 

k 

Therefore,    the  minimum  of     Z  A.^g(E.)     must  ha-.'e 

1      ^  ^ 

k 

(v)      E      =  E   E./k  =  6  -   (n„  +  6   -  n)/k        and        (vi)      A.    =  A/k    , 

for  all     1  5  i  <  k 
Thus. 

^  1 

(4)  WSS  =  Z  VSS^   -  2k  f A^    *    8(6   _   (n^  +  6   -  n)/k)    . 

Now,    for     k  >  4    ,      a  lower  bound   of     n        is  3. 

B 

It  follows  that    6  -  (n„  +  6  -  n)/k  <  6(1  +  (n  -  9) /6k)  . 
Since   g(E.)   is  a  monotone  decreasing  function  of   E.  , 
g(6  -  (n^  +  6  -  n)/k)  >  g(6(l  +  (n  -  9)/&k))  . 
Therefore,  from  (4),  we  have 

(5)  WSS  S  ^  f a2  .  g(6(l  +  (n  -  9)/6k))  . 

Now  for  fixed  n,   6(1  +  (n-  9) /6k)  ->  6     as  k  ->  »  . 
Therefore,  since  g  is  continuous, 

lim  inf  WSS/(-^  •  -i^)  -  ^  » 
k-*-  ^    ^^ 

and  the  theorem  follows. 
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Corollary 

Let  t-^  be  a  region  with  a  connected  interior  in  IR^   whose  boundary  is 
of  finite  length.   Let  A  be  the  area  of  ^    and  let   1/A  be  the 
constant  density  ov&r  ^  .      Let  V^S     be  the  ninimum  within  cluster  sum 
of  squares  over  all  k  partitions  of  j^.   Then 

Proof. 

For  each  n  ,   let  jj/   be  an  n-sidcj  polygon  of  area  A   approximating 

tJ!^     from  the  inside.   Then  A  /A  =^  1  +  C   ,   where  C  -*■  0  as  n  ->  «  . 

n         n  n 

Denote  the  minimum  within  cluster  sum  of  squares  over  all  k  partitions 

of  ^      by  WSS   .   Then,  since  jn/    c  ^        l,'SS  <  WSS  . 
n         n  n      '     n 

Thus,  by  putting  f  =  1/A  ,   we  have 

i-wss/(l^.^)>iSwss  /(^-  ^) 


k->«» 


k-x"    n   k     54 


k->«»    n   k      54       n 


Letting  n  ->  »  ,   and  using  Theorem   1  ,  we  obtain 

lim  „Q(;//fA2   5/3.    , 


Theorem  2 

Under   the  hypothesis   stated   in  the  Corollary 


k->«°  k  54 


Proof. 

Using  the  Corollary,  it  is  sufficient  to  shoi; 

lira  WSS/(^  .  -^  <  1  . 

Now  given  the  region  jy  ,   we  can  always 
construct  a  region  iJ)    of  area  B  consisting 


Fig.  5 
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of   k  connected  regular  hexagons  (.sen  Fig.  5)      such  that 

(1)  £8     ^jd  ,      and 

(2)  is  ^/A  =  1  . 

Let  V.'SS^  be  the  minimuni  within  cluster  su.-a  of  squares  over  all  k- 
partitions  of  ^,   and  let   1/A  be  the  constant  density  over  U)   . 

Then      '   —   >  WSS/jT)  ,      since  the  k   regular  hexagons  form  a 
54   Ak       '^ 


k-partition  of  ^  .      Now,  from  (1),   WSS   >  WSS  ,   and  hence 

5A   Ale 


Thus, 

(3)    V.'SS/(-5g  .  ^)  S  b2/a2  , 
and  the  theorem  follows  from  (2)  and  (3) . 


The  result  of  Theorem  2  only  gives  a  lower  bound  for  the  overall 
within  cluster  sum  of  squares  of  the  k-means  partition.   It  falls  short  in 
showing  that  the  within  sum  of  squares  of  the  k  clusters  are  asymptotically 
equal  (a  conjecture  due  to  Professor  John  A.  Hartigan) .   Much  work  has 
yet  to  be  done  to  prove  the  conjecture  for  two  or  more  dimensional  distributions, 
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3.   EMPIRICAL  ILLUSTRATIONS 

In  order  to  illustrate  the  asymptotic  properties  of  bivariate  k-means 
clusters  obtained  in  Section  2,  an  empirical  study  is  performed  using 
bivariate  samples  generated  according  to  the  uniform  distribution  on  the 
unit  square.   It  is  necessary  to  estimate  the  within  cluster  sum  of  squares 
WSS  for  various  values  of  k  from  generated  samples  because  the  WSS  for  the 
optimal  k-mean  partition  of  the  unit  square  cannot  be  obtained  analytically 
frc   large  values  of  k.   Here,  the  results  of  three  sets  of  experiments  us- 
ing different  sample  sizes  are  reported. 

In  Experiment  One,  four  different  samples  of  size  N  =  1500  are  gener- 
ated from  the  uniform  distribution  on  the  unit  square.   Using  k  =  40,  50, 
60,  and  70  for  the  different  samples,  unbiased  estimates  WSS  of  WSS  for 

the  different  cluster  sizes  are  obtained.   The  values  of  WSS,  for  the  vari- 

N 

ous  values  of  k  are  given  alongside  the  corresponding  asymptotic  lower 

5/3    1 
bounds  for  WSS  (that  is,   54     k  )  in  Table  1,  and  the  corresponding 

pairs  are  found  to  be  in  close  agreement  with  one  another.   Similarly,  three 

different  samples  of  size  N  =  2500  are  generated  in  Experiment  Two,  and 

the  values  of  k  used  for  the  three  samples  are  k  =  50 ,  60,  and  70;  while 

in  Experiment  Three,  the  three  generated  samples  are  of  size  N  =  4000,  and 

the  values  of  k  used  are  k  =  60,  80,  and  100.   The  resulting  WSS   values 

for  these  six  experimental  trials  are  also  given  in  Table  1.   Again,  the 

values  of  WSS.,   for  the  various  values  of  k  are  found  to  be  in  close 

N 

agreement  with  the  corresponding  lower  bounds  for  WSS.   Hence,  these  empiri- 
cal results  tend  to  indicate  that  the  asymptotic  lower  bound  obtained  in 
Theorem  2  is  the  WSS  for  the  optimal  k-means  partition  when  k  becomes 
large . 
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Table      1 


Sample  Size  (N) 

k 

1500 

40 

1500 

50 

1500 

60 

1500 

70 

2500 

50 

2500 

60 

2500 

70 

AOOO 

60 

AOOO 

80 

4000 

100 

WSS    (x   10    "^) 


4.059 
3.208 
2.686 
2.259 
3.253 
2.733 
2.393 
2.659 
2.035 
1.611 


5/3~  1 


.-3. 


54  k  (x  10  ) 
4.009 
3.208 
2.673 
2.291 
3.208 
2.673 
2.291 
2.673 
2.005 
1.604 


The  graph  configurations  of  two  of  the  sample  k-means  partitions  are  given 
in  Figures  1  and  2.   In  Figure  1,  the  sample  size  used  is  1500  and  k  =  50, 
while  in  Figure  2,  the  sample  size  used  is  4000  and  k  =  100.   Although  only 
a  few  regular  hexagons  appear  in  these  two  configurations,  it  seems  plausible 
that,  when  k  is  large  enough,  the  graph  configuration  of  optimal  k-means 
partition  would  consist  mostly  of  regular  hexagons. 
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Figure  1.    Graph  Configuration  of  sample  k-means  partition  (k  =  50) 

obtained  for  1500  observations  from  the  uniform  distribution. 
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Figure  2.    Graph  configuration  of  sample  k-means  partition  (k  -  100) 

obtained  far  4000  observations  from  the  uniform  distribution. 
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